5 research outputs found

    Massive Data-Centric Parallelism in the Chiplet Era

    Full text link
    Traditionally, massively parallel applications are executed on distributed systems, where computing nodes are distant enough that the parallelization schemes must minimize communication and synchronization to achieve scalability. Mapping communication-intensive workloads to distributed systems requires complicated problem partitioning and dataset pre-processing. With the current AI-driven trend of having thousands of interconnected processors per chip, there is an opportunity to re-think these communication-bottlenecked workloads. This bottleneck often arises from data structure traversals, which cause irregular memory accesses and poor cache locality. Recent works have introduced task-based parallelization schemes to accelerate graph traversal and other sparse workloads. Data structure traversals are split into tasks and pipelined across processing units (PUs). Dalorex demonstrated the highest scalability (up to thousands of PUs on a single chip) by having the entire dataset on-chip, scattered across PUs, and executing the tasks at the PU where the data is local. However, it also raised questions on how to scale to larger datasets when all the memory is on chip, and at what cost. To address these challenges, we propose a scalable architecture composed of a grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time reconfiguration enables creating chip products that optimize for different target metrics, such as time-to-solution, energy, or cost, while software reconfigurations avoid network saturation when scaling to millions of PUs across many chip packages. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of data-local execution at scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs reaches 3323 GTEPS

    Microarchitectures for Heterogeneous Superconducting Quantum Computers

    Full text link
    Noisy Intermediate-Scale Quantum Computing (NISQ) has dominated headlines in recent years, with the longer-term vision of Fault-Tolerant Quantum Computation (FTQC) offering significant potential albeit at currently intractable resource costs and quantum error correction (QEC) overheads. For problems of interest, FTQC will require millions of physical qubits with long coherence times, high-fidelity gates, and compact sizes to surpass classical systems. Just as heterogeneous specialization has offered scaling benefits in classical computing, it is likewise gaining interest in FTQC. However, systematic use of heterogeneity in either hardware or software elements of FTQC systems remains a serious challenge due to the vast design space and variable physical constraints. This paper meets the challenge of making heterogeneous FTQC design practical by introducing HetArch, a toolbox for designing heterogeneous quantum systems, and using it to explore heterogeneous design scenarios. Using a hierarchical approach, we successively break quantum algorithms into smaller operations (akin to classical application kernels), thus greatly simplifying the design space and resulting tradeoffs. Specializing to superconducting systems, we then design optimized heterogeneous hardware composed of varied superconducting devices, abstracting physical constraints into design rules that enable devices to be assembled into standard cells optimized for specific operations. Finally, we provide a heterogeneous design space exploration framework which reduces the simulation burden by a factor of 10^4 or more and allows us to characterize optimal design points. We use these techniques to design superconducting quantum modules for entanglement distillation, error correction, and code teleportation, reducing error rates by 2.6x, 10.7x, and 3.0x compared to homogeneous systems
    corecore